Introduction to Machine Learning

Use Cases:

1.Fraud Detection
2.Credit Scoring & Next Best offers
3.Prediction of Equipment failures
4.New pricing models
5.Customer Segmentation
6.Text Sentiment Analysis
7.Email Spam Filtering
8.Financial Modeling

Types of Machine Learning

1. Supervised
2. Unsupervised 
3. Reinforcement Learning

Machine Learning with Python using Scikit Learn

Syntax:

from sklearn.family import Model

from sklearn.linear_model import LinearRegression

Estimator parameters: All the parameters of an estimator can be when it is instantiated, and have suitable default values

Cross Validation Types

1. Train_test_split

How to address the issue of overfitting & underfitting

Underfitting/High Bias: This form of hypothesis function h maps poorly to the trend of the data.

        Reason : 
        1. Function that is too simple
        2. Uses too few features

Overfitting/High Variance: Tjis forms the hypothesis function that fits the available data but does not generalize well to predict new data

1. Reduce the number of features:

        a. Manually select which features to keep
        b. Use model selection algorithm 

2. Regulariation 

        a. Keep all the features, but reduce the magnitude of parameters ThetaJ
        b. Regularization works well when we have a lot of slightly useful features

Linear Regression

What is Linear Regression?

Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables.

$ Y = βo + β1X + ∈ $

1. Y   -  This is the variable we predict
2. X   -  This is the variable we use to make a prediction
3. βo  -  This is the intercept term. It is the prediction value you get when X = 0
4. β1  -  This is the slope term. It explains the change in Y when X changes by 1 unit. ∈ - This represents the residual   value, i.e. the difference between actual and predicted values.

5. Error reduction Techniques
    a. Ordinary Least Square - ∑[Actual(y) - Predicted(y')]²                                                                                                                                                                                                             Why OLS?
        i. It uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent
        ii.OLS is easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features
        iii.Interpretation of OLS is much easier than other regression techniques

    b. Generalized Least Square
    c. Percentage Least Square
    d. Total Least Square
    e. Least absolute deviation

Formula for calculating the coefficients

$β1 = Σ(xi - xmean)(yi-ymean)/ Σ (xi - xmean)²$ where i= 1 to n (no. of obs.)

$βo = ymean - β1(xmean)$

Case Study : Predicting Housing Price



In [196]:

    
# Case Study : Predicting Housing Price

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [197]:

    
df = pd.read_csv("C:/Users/melvin/Machine Learning/Linear Regression/USA_Housing.csv")



In [198]:

    
# Summary
df.head()









    Out[198]:






  
    
      
      Avg. Area Income
      Avg. Area House Age
      Avg. Area Number of Rooms
      Avg. Area Number of Bedrooms
      Area Population
      Price
      Address
    
  
  
    
      0
      79545.458574
      5.682861
      7.009188
      4.09
      23086.800503
      1.059034e+06
      208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
    
    
      1
      79248.642455
      6.002900
      6.730821
      3.09
      40173.072174
      1.505891e+06
      188 Johnson Views Suite 079\nLake Kathleen, CA...
    
    
      2
      61287.067179
      5.865890
      8.512727
      5.13
      36882.159400
      1.058988e+06
      9127 Elizabeth Stravenue\nDanieltown, WI 06482...
    
    
      3
      63345.240046
      7.188236
      5.586729
      3.26
      34310.242831
      1.260617e+06
      USS Barnett\nFPO AP 44820
    
    
      4
      59982.197226
      5.040555
      7.839388
      4.23
      26354.109472
      6.309435e+05
      USNS Raymond\nFPO AE 09386



In [191]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.5+ KB



In [192]:

    
# Validating Linear Regression Assumptions
df.describe()









    Out[192]:






  
    
      
      Avg. Area Income
      Avg. Area House Age
      Avg. Area Number of Rooms
      Avg. Area Number of Bedrooms
      Area Population
      Price
    
  
  
    
      count
      5000.000000
      5000.000000
      5000.000000
      5000.000000
      5000.000000
      5.000000e+03
    
    
      mean
      68583.108984
      5.977222
      6.987792
      3.981330
      36163.516039
      1.232073e+06
    
    
      std
      10657.991214
      0.991456
      1.005833
      1.234137
      9925.650114
      3.531176e+05
    
    
      min
      17796.631190
      2.644304
      3.236194
      2.000000
      172.610686
      1.593866e+04
    
    
      25%
      61480.562388
      5.322283
      6.299250
      3.140000
      29403.928702
      9.975771e+05
    
    
      50%
      68804.286404
      5.970429
      7.002902
      4.050000
      36199.406689
      1.232669e+06
    
    
      75%
      75783.338666
      6.650808
      7.665871
      4.490000
      42861.290769
      1.471210e+06
    
    
      max
      107701.748378
      9.519088
      10.759588
      6.500000
      69621.713378
      2.469066e+06



In [193]:

    
df.columns









    Out[193]:





Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')



In [194]:

    
sns.pairplot(df)









    Out[194]:





<seaborn.axisgrid.PairGrid at 0x195de5f8>



In [195]:

    
sns.distplot(df['Price'])









    



C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[195]:





<matplotlib.axes._subplots.AxesSubplot at 0x1c71beb8>



In [89]:

    
sns.heatmap(df.corr(),annot=True)









    Out[89]:





<matplotlib.axes._subplots.AxesSubplot at 0x13b61ef0>

Validate the Linear Regression Model

1. There exists a linear and additive relationship between dependent (DV) and independent variables (IV)
2. Multicollinearity - present of correlation b/w independent variables
3. Heteroskedestacity- Absence of constant variance in the error terms  
4. Autocorrelation - Presences of correlation in error terms
5. The dependent variable and the error terms should possess a normal distribution

Checking whether the assumptions are violated



In [181]:

    
sns.pairplot(df)









    Out[181]:





<seaborn.axisgrid.PairGrid at 0x17906780>



In [17]:

    
df.columns









    Out[17]:





Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')



In [100]:

    
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
y= df['Price']









    



  File "<ipython-input-100-6a15e861f2ac>", line 3
    sns.residplot(x=X,y=y,data=,color='blue')
                               ^
SyntaxError: invalid syntax



In [113]:

    
# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))



In [28]:

    
from sklearn.linear_model import LinearRegression
lm = LinearRegression()



In [33]:

    
lm.fit(X_train,y_train)
print('Intercept = ', lm.intercept_ ,'Coefficients = ',lm.coef_)









    



Intercept =  -2628436.30126 Coefficients =  [  2.15423467e+01   1.64823861e+05   1.19807562e+05   2.31574320e+03
   1.52295835e+01]



In [138]:

    
import seaborn as sns
anscombe = sns.load_dataset("anscombe")

rs = np.random.RandomState(7)
x= rs.normal(2,1,75)
y= 2 +1.5*x+rs.normal(0,2,75)

sns.residplot(x,y,lowess=True)









    Out[138]:





<matplotlib.axes._subplots.AxesSubplot at 0x148ad6a0>



In [91]:

    
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
boston.feature_names









    Out[91]:





array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')



In [152]:

    
'''
import statsmodels.api as sm
model = sm.OLS(y_train,X_train).fit()
predictions = model.predict(X_test)

# Print out the statistics
model.summary()
'''









    Out[152]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.957 


  Model:                    OLS          Adj. R-squared:        0.956 


  Method:              Least Squares     F-statistic:           668.7 


  Date:              Thu, 03 Aug 2017    Prob (F-statistic):  1.02e-257


  Time:                  14:55:47        Log-Likelihood:      -1226.0 


  No. Observations:          404         AIC:                   2478. 


  Df Residuals:              391         BIC:                   2530. 


  Df Model:                   13                                      


  Covariance Type:       nonrobust                                    




         coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  x1      -0.1020      0.038     -2.697   0.007     -0.176    -0.028


  x2       0.0426      0.017      2.483   0.013      0.009     0.076


  x3      -0.0367      0.071     -0.517   0.605     -0.176     0.103


  x4       3.3418      1.029      3.248   0.001      1.319     5.365


  x5      -1.3299      3.850     -0.345   0.730     -8.899     6.239


  x6       5.6706      0.353     16.078   0.000      4.977     6.364


  x7       0.0025      0.016      0.163   0.871     -0.028     0.033


  x8      -0.8610      0.220     -3.919   0.000     -1.293    -0.429


  x9       0.1935      0.074      2.629   0.009      0.049     0.338


  x10     -0.0090      0.004     -2.121   0.035     -0.017    -0.001


  x11     -0.4256      0.127     -3.349   0.001     -0.676    -0.176


  x12      0.0175      0.003      5.729   0.000      0.011     0.023


  x13     -0.4571      0.059     -7.787   0.000     -0.573    -0.342




  Omnibus:        151.909    Durbin-Watson:         1.990 


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    949.294 


  Skew:            1.464     Prob(JB):           7.30e-207


  Kurtosis:        9.916     Cond. No.           8.46e+03



In [169]:

    
X = boston.data
y = boston.target



# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()

lm.fit(X_train,y_train)


print('R-Square =',(lm.score(X_train,y_train) * 100))









    



R-Square = 72.8581829267



In [ ]:

Predictions



In [182]:

    
predictions = lm.predict(X_test)

accuracy = (metrics.r2_score(y_test,predictions))
print('R-Square',accuracy*100)









    



R-Square 77.8720987477



In [176]:

    
plt.scatter(y_test,predictions)









    Out[176]:





<matplotlib.collections.PathCollection at 0x155c5978>



In [118]:

    
import seaborn as sns
sns.distplot((y_test-predictions))









    



C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[118]:





<matplotlib.axes._subplots.AxesSubplot at 0x144e9390>

Evaluation Metrics



In [187]:

    
from sklearn import metrics
print('MAE = ', metrics.mean_absolute_error(y_test,predictions))
print('MAE = ',metrics.mean_squared_error(y_test,predictions))
print('MAE = ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))
print('R-Square =',metrics.explained_variance_score(y_test,predictions))









    



MAE =  3.11306137942
MAE =  18.5121317884
MAE =  4.30257269415
R-Square = 0.780463770177

RMSE / MSE / MAE -

Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:

MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.

MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20

RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.

Logistic Regression

Logistic Regression belongs to the family of generalized linear models. It is a binary classification algorithm used when the response variable is dichotomous (1 or 0).

Examples:

1. Ham/Spam
2. Loan Defaulters(Yes/No)
3. Disease Diagnosis

Assumptions :

1. The response variable must follow a binominal distribution
2. Logistic Regression assumes a linear relationship between the independent variables and the link function(logit)
3. The dependent variable should have mutually exclusive and exhaustive categories

Note : The linear function results in probabilities outside [0,1] interval, thereby making them invalid predictions

Types of Logistics Regression

1. Multinomial Logistics Regression
2. Ordinal Logistics Regression

Multinomial Logistic Regression

The technique handles the multi-class problem by fitting K-1 independent binary logistic classifier model.

Drawback:

a. It doesn't scale well in the presence of a large number of target classes
b. Requires a larger dataset to achieve reasonable accuracy

Ordinal Logistics Regression

This technique is used when the target variable in ordinal in nature (Eg. years of work experience 5>4>3>2>1). Ordinal Logistics Regression builds a single model with mutiple threshold values.

If we have K classes, the model will require K -1 threshold or cutoff points. Also, it makes an imperative assumption of proportional odds. The assumption says that on a logit (S shape) scale, all of the thresholds lie on a straight line.

Note: Logistic Regression is not a great choice to solve multi-class problems. But, it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression for binary classification task.

Binomial Distribution Characteristics

    1. There must be fixed number of trials denoted by n
    2. Each trial has only two outcomes
    3. The outcome of each trial must be independent of each other
    4. The probability of sucess(p) & failure should be the same for each trail

How Logistic Regression Works?

a. A Unit change in input feature doesn't really affect the model output directly but it affects the odds ratio

b. We use Maximum likehood method to determine the best coefficients and eventually a good model fit.(Tries to find values of βo and β1 such that the resultant probabilities are closest to either 1 or 0)

How can you evaluate Logistic Regression model fit and accuracy ?

   1. Akaike Information Criteria

       a. Counter part of Adjusted R-Square in multiple regression
       b. Smaller the better
       c. Adding more variables to the model wouldn't let AIC increase. It helps overfitting

Note: Looking at one AIC metric of one model would help. So build 2 or 3 Logistic Regression models and compare their AIC

   2. Null Deviance and Residual Deviance

       a. Deviance of an observation is computed as -2 times log likelihood of that observation
       b. The null model predicts class via a constant probability
       c. Residual deviance is calculated from the model having all the features
       d. Results: 
               i.   The larger the difference between null and residual deviance, better the model
               ii.  Lower null deviance, means that the model explains deviance pretty well, and is a better model
               iii. Lower the residual deviance, better the model 

   3. Confusion Matrix

                          1               0  
                      (Predicted)    (Predicted)

          1
       (Actual)             TP           FN (Type 2)



           0
       (Actual)             FP(Type 1)           TN  


       Metrics:

       Accuracy :It determines the overall predicted accuracy of the model.

       Accuracy = (TP + TN) / (TP + TN + FP + FN)

       True Positive Rate/Sensitivity/Recall : It indicates how many positive values, out of all the positive values.have   
       been correctly predicted 

       Sensitivity/Recall = TP / TP + FN

       False Negative Rate : 1- Sensitivity

       True Negative Rate /Specificity: It indicates how many negative values, out of all the negative values, have 
       been correctly predicted   

       FP Rate : 1 - Specificity

       Precision : It indicates how many values, out of all the predicted positive values, are actually positive.

       Precision = TP/ TP+ FP 

       F Score: F score is the harmonic mean of precision and recall. 
                It lies between 0 and 1. Higher the value, better the model.
                It is formulated as 2((precision*recall) / (precision+recall))


       4. Receiver Operator Characteristic (ROC)

       ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model's              accuracy using Area Under Curve (AUC).

       Measure : Higher the area, better the model



In [347]:

    
# Case Study Titanic Dataset:

# Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [430]:

    
# Importing the requried training dataset
train= pd.read_csv("C:/Users/melvin/Machine Learning/Logistic Regression/train.csv")



In [431]:

    
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')









    Out[431]:





<matplotlib.axes._subplots.AxesSubplot at 0x27bb86d8>

Findings :

       1. We are missing a lot of Cabin information
       2. A lot of Age information is also missing 
       3. One Embarked informatio is missing

Solution:

       1. For Age we can go for missing value imputation
       2. We can drop or tranform it into Categorial variables like Known/Unknown



In [432]:

    
sns.set_style('whitegrid')



In [433]:

    
sns.countplot(x='Survived',hue='Sex',data=train)









    Out[433]:





<matplotlib.axes._subplots.AxesSubplot at 0x27bc2f28>



In [434]:

    
sns.countplot(x='Survived',hue='Pclass',data=train)









    Out[434]:





<matplotlib.axes._subplots.AxesSubplot at 0x27ccc4e0>



In [435]:

    
sns.distplot(train['Age'].dropna(),bins=30,kde = False)









    Out[435]:





<matplotlib.axes._subplots.AxesSubplot at 0x27ddc6d8>



In [436]:

    
train.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB



In [437]:

    
sns.countplot(x='SibSp',data=train,hue='Survived')









    Out[437]:





<matplotlib.axes._subplots.AxesSubplot at 0x27ee8320>



In [438]:

    
Finding :
    
    1. We are able to see that the people who had no siblings or 1 siblings have died. Its opposite of what I thought.









    



  File "<ipython-input-438-941e121aa2d0>", line 1
    Finding :
             ^
SyntaxError: invalid syntax



In [439]:

    
'''
import cufflinks as cf
cf.go_offline()
train['Fare'].iplot(kind='hist',bins=50)
'''









    Out[439]:





"\nimport cufflinks as cf\ncf.go_offline()\ntrain['Fare'].iplot(kind='hist',bins=50)\n"

Cleaning our data

1. Filling the missing values with mean/median



In [440]:

    
plt.figure(figsize=(10,7))
sns.boxplot(y='Age',x='Pclass',data=train)









    Out[440]:





<matplotlib.axes._subplots.AxesSubplot at 0x2816a6a0>



In [441]:

    
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[0]
    
    if pd.isnull(Age):
        
        if Pclass == 1:
                return 37
        elif Pclass == 2:
                return 29
        else:
            return 24
    else:
        return Age
    
     
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)



In [442]:

    
plt.figure(figsize=(18,8))
sns.heatmap(train.isnull(),yticklabels=False,cmap='viridis')









    Out[442]:





<matplotlib.axes._subplots.AxesSubplot at 0x2827f9e8>



In [443]:

    
train.drop('Cabin',axis=1,inplace=True)



In [444]:

    
train.dropna(inplace=True)



In [445]:

    
# Converting the Categorcial Variable into dummy variable
sex = pd.get_dummies(train['Sex'],drop_first=True)
sex.count()









    Out[445]:





male    889
dtype: int64



In [446]:

    
embark = pd.get_dummies(train['Embarked'],drop_first=True)
embark.count()









    Out[446]:





Q    889
S    889
dtype: int64



In [447]:

    
train = pd.concat([train,sex,embark],axis=1)
train









    Out[447]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Embarked
      male
      Q
      S
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      S
      1
      0
      1
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C
      0
      0
      0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      S
      0
      0
      1
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      S
      0
      0
      1
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      S
      1
      0
      1
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      24.0
      0
      0
      330877
      8.4583
      Q
      1
      1
      0
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      S
      1
      0
      1
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.0
      3
      1
      349909
      21.0750
      S
      1
      0
      1
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      S
      0
      0
      1
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      C
      0
      0
      0
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      S
      0
      0
      1
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      S
      0
      0
      1
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20.0
      0
      0
      A/5. 2151
      8.0500
      S
      1
      0
      1
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39.0
      1
      5
      347082
      31.2750
      S
      1
      0
      1
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      S
      0
      0
      1
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55.0
      0
      0
      248706
      16.0000
      S
      0
      0
      1
    
    
      16
      17
      0
      3
      Rice, Master. Eugene
      male
      2.0
      4
      1
      382652
      29.1250
      Q
      1
      1
      0
    
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      24.0
      0
      0
      244373
      13.0000
      S
      1
      0
      1
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31.0
      1
      0
      345763
      18.0000
      S
      0
      0
      1
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      24.0
      0
      0
      2649
      7.2250
      C
      0
      0
      0
    
    
      20
      21
      0
      2
      Fynney, Mr. Joseph J
      male
      35.0
      0
      0
      239865
      26.0000
      S
      1
      0
      1
    
    
      21
      22
      1
      2
      Beesley, Mr. Lawrence
      male
      34.0
      0
      0
      248698
      13.0000
      S
      1
      0
      1
    
    
      22
      23
      1
      3
      McGowan, Miss. Anna "Annie"
      female
      15.0
      0
      0
      330923
      8.0292
      Q
      0
      1
      0
    
    
      23
      24
      1
      1
      Sloper, Mr. William Thompson
      male
      28.0
      0
      0
      113788
      35.5000
      S
      1
      0
      1
    
    
      24
      25
      0
      3
      Palsson, Miss. Torborg Danira
      female
      8.0
      3
      1
      349909
      21.0750
      S
      0
      0
      1
    
    
      25
      26
      1
      3
      Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
      female
      38.0
      1
      5
      347077
      31.3875
      S
      0
      0
      1
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      24.0
      0
      0
      2631
      7.2250
      C
      1
      0
      0
    
    
      27
      28
      0
      1
      Fortune, Mr. Charles Alexander
      male
      19.0
      3
      2
      19950
      263.0000
      S
      1
      0
      1
    
    
      28
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      24.0
      0
      0
      330959
      7.8792
      Q
      0
      1
      0
    
    
      29
      30
      0
      3
      Todoroff, Mr. Lalio
      male
      24.0
      0
      0
      349216
      7.8958
      S
      1
      0
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      862
      0
      2
      Giles, Mr. Frederick Edward
      male
      21.0
      1
      0
      28134
      11.5000
      S
      1
      0
      1
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48.0
      0
      0
      17466
      25.9292
      S
      0
      0
      1
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      24.0
      8
      2
      CA. 2343
      69.5500
      S
      0
      0
      1
    
    
      864
      865
      0
      2
      Gill, Mr. John William
      male
      24.0
      0
      0
      233866
      13.0000
      S
      1
      0
      1
    
    
      865
      866
      1
      2
      Bystrom, Mrs. (Karolina)
      female
      42.0
      0
      0
      236852
      13.0000
      S
      0
      0
      1
    
    
      866
      867
      1
      2
      Duran y More, Miss. Asuncion
      female
      27.0
      1
      0
      SC/PARIS 2149
      13.8583
      C
      0
      0
      0
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31.0
      0
      0
      PC 17590
      50.4958
      S
      1
      0
      1
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      24.0
      0
      0
      345777
      9.5000
      S
      1
      0
      1
    
    
      869
      870
      1
      3
      Johnson, Master. Harold Theodor
      male
      4.0
      1
      1
      347742
      11.1333
      S
      1
      0
      1
    
    
      870
      871
      0
      3
      Balkic, Mr. Cerin
      male
      26.0
      0
      0
      349248
      7.8958
      S
      1
      0
      1
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0
      1
      1
      11751
      52.5542
      S
      0
      0
      1
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33.0
      0
      0
      695
      5.0000
      S
      1
      0
      1
    
    
      873
      874
      0
      3
      Vander Cruyssen, Mr. Victor
      male
      47.0
      0
      0
      345765
      9.0000
      S
      1
      0
      1
    
    
      874
      875
      1
      2
      Abelson, Mrs. Samuel (Hannah Wizosky)
      female
      28.0
      1
      0
      P/PP 3381
      24.0000
      C
      0
      0
      0
    
    
      875
      876
      1
      3
      Najib, Miss. Adele Kiamie "Jane"
      female
      15.0
      0
      0
      2667
      7.2250
      C
      0
      0
      0
    
    
      876
      877
      0
      3
      Gustafsson, Mr. Alfred Ossian
      male
      20.0
      0
      0
      7534
      9.8458
      S
      1
      0
      1
    
    
      877
      878
      0
      3
      Petroff, Mr. Nedelio
      male
      19.0
      0
      0
      349212
      7.8958
      S
      1
      0
      1
    
    
      878
      879
      0
      3
      Laleff, Mr. Kristo
      male
      24.0
      0
      0
      349217
      7.8958
      S
      1
      0
      1
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56.0
      0
      1
      11767
      83.1583
      C
      0
      0
      0
    
    
      880
      881
      1
      2
      Shelley, Mrs. William (Imanita Parrish Hall)
      female
      25.0
      0
      1
      230433
      26.0000
      S
      0
      0
      1
    
    
      881
      882
      0
      3
      Markun, Mr. Johann
      male
      33.0
      0
      0
      349257
      7.8958
      S
      1
      0
      1
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22.0
      0
      0
      7552
      10.5167
      S
      0
      0
      1
    
    
      883
      884
      0
      2
      Banfield, Mr. Frederick James
      male
      28.0
      0
      0
      C.A./SOTON 34068
      10.5000
      S
      1
      0
      1
    
    
      884
      885
      0
      3
      Sutehall, Mr. Henry Jr
      male
      25.0
      0
      0
      SOTON/OQ 392076
      7.0500
      S
      1
      0
      1
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39.0
      0
      5
      382652
      29.1250
      Q
      0
      1
      0
    
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27.0
      0
      0
      211536
      13.0000
      S
      1
      0
      1
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      S
      0
      0
      1
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      24.0
      1
      2
      W./C. 6607
      23.4500
      S
      0
      0
      1
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C
      1
      0
      0
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      Q
      1
      1
      0
    
  

889 rows × 14 columns



In [448]:

    
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)



In [449]:

    
train.head()









    Out[449]:






  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
      male
      Q
      S
    
  
  
    
      0
      1
      0
      3
      22.0
      1
      0
      7.2500
      1
      0
      1
    
    
      1
      2
      1
      1
      38.0
      1
      0
      71.2833
      0
      0
      0
    
    
      2
      3
      1
      3
      26.0
      0
      0
      7.9250
      0
      0
      1
    
    
      3
      4
      1
      1
      35.0
      1
      0
      53.1000
      0
      0
      1
    
    
      4
      5
      0
      3
      35.0
      0
      0
      8.0500
      1
      0
      1



In [450]:

    
train.drop(['PassengerId'],axis=1,inplace=True)



In [451]:

    
train









    Out[451]:






  
    
      
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
      male
      Q
      S
    
  
  
    
      0
      0
      3
      22.0
      1
      0
      7.2500
      1
      0
      1
    
    
      1
      1
      1
      38.0
      1
      0
      71.2833
      0
      0
      0
    
    
      2
      1
      3
      26.0
      0
      0
      7.9250
      0
      0
      1
    
    
      3
      1
      1
      35.0
      1
      0
      53.1000
      0
      0
      1
    
    
      4
      0
      3
      35.0
      0
      0
      8.0500
      1
      0
      1
    
    
      5
      0
      3
      24.0
      0
      0
      8.4583
      1
      1
      0
    
    
      6
      0
      1
      54.0
      0
      0
      51.8625
      1
      0
      1
    
    
      7
      0
      3
      2.0
      3
      1
      21.0750
      1
      0
      1
    
    
      8
      1
      3
      27.0
      0
      2
      11.1333
      0
      0
      1
    
    
      9
      1
      2
      14.0
      1
      0
      30.0708
      0
      0
      0
    
    
      10
      1
      3
      4.0
      1
      1
      16.7000
      0
      0
      1
    
    
      11
      1
      1
      58.0
      0
      0
      26.5500
      0
      0
      1
    
    
      12
      0
      3
      20.0
      0
      0
      8.0500
      1
      0
      1
    
    
      13
      0
      3
      39.0
      1
      5
      31.2750
      1
      0
      1
    
    
      14
      0
      3
      14.0
      0
      0
      7.8542
      0
      0
      1
    
    
      15
      1
      2
      55.0
      0
      0
      16.0000
      0
      0
      1
    
    
      16
      0
      3
      2.0
      4
      1
      29.1250
      1
      1
      0
    
    
      17
      1
      2
      24.0
      0
      0
      13.0000
      1
      0
      1
    
    
      18
      0
      3
      31.0
      1
      0
      18.0000
      0
      0
      1
    
    
      19
      1
      3
      24.0
      0
      0
      7.2250
      0
      0
      0
    
    
      20
      0
      2
      35.0
      0
      0
      26.0000
      1
      0
      1
    
    
      21
      1
      2
      34.0
      0
      0
      13.0000
      1
      0
      1
    
    
      22
      1
      3
      15.0
      0
      0
      8.0292
      0
      1
      0
    
    
      23
      1
      1
      28.0
      0
      0
      35.5000
      1
      0
      1
    
    
      24
      0
      3
      8.0
      3
      1
      21.0750
      0
      0
      1
    
    
      25
      1
      3
      38.0
      1
      5
      31.3875
      0
      0
      1
    
    
      26
      0
      3
      24.0
      0
      0
      7.2250
      1
      0
      0
    
    
      27
      0
      1
      19.0
      3
      2
      263.0000
      1
      0
      1
    
    
      28
      1
      3
      24.0
      0
      0
      7.8792
      0
      1
      0
    
    
      29
      0
      3
      24.0
      0
      0
      7.8958
      1
      0
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      0
      2
      21.0
      1
      0
      11.5000
      1
      0
      1
    
    
      862
      1
      1
      48.0
      0
      0
      25.9292
      0
      0
      1
    
    
      863
      0
      3
      24.0
      8
      2
      69.5500
      0
      0
      1
    
    
      864
      0
      2
      24.0
      0
      0
      13.0000
      1
      0
      1
    
    
      865
      1
      2
      42.0
      0
      0
      13.0000
      0
      0
      1
    
    
      866
      1
      2
      27.0
      1
      0
      13.8583
      0
      0
      0
    
    
      867
      0
      1
      31.0
      0
      0
      50.4958
      1
      0
      1
    
    
      868
      0
      3
      24.0
      0
      0
      9.5000
      1
      0
      1
    
    
      869
      1
      3
      4.0
      1
      1
      11.1333
      1
      0
      1
    
    
      870
      0
      3
      26.0
      0
      0
      7.8958
      1
      0
      1
    
    
      871
      1
      1
      47.0
      1
      1
      52.5542
      0
      0
      1
    
    
      872
      0
      1
      33.0
      0
      0
      5.0000
      1
      0
      1
    
    
      873
      0
      3
      47.0
      0
      0
      9.0000
      1
      0
      1
    
    
      874
      1
      2
      28.0
      1
      0
      24.0000
      0
      0
      0
    
    
      875
      1
      3
      15.0
      0
      0
      7.2250
      0
      0
      0
    
    
      876
      0
      3
      20.0
      0
      0
      9.8458
      1
      0
      1
    
    
      877
      0
      3
      19.0
      0
      0
      7.8958
      1
      0
      1
    
    
      878
      0
      3
      24.0
      0
      0
      7.8958
      1
      0
      1
    
    
      879
      1
      1
      56.0
      0
      1
      83.1583
      0
      0
      0
    
    
      880
      1
      2
      25.0
      0
      1
      26.0000
      0
      0
      1
    
    
      881
      0
      3
      33.0
      0
      0
      7.8958
      1
      0
      1
    
    
      882
      0
      3
      22.0
      0
      0
      10.5167
      0
      0
      1
    
    
      883
      0
      2
      28.0
      0
      0
      10.5000
      1
      0
      1
    
    
      884
      0
      3
      25.0
      0
      0
      7.0500
      1
      0
      1
    
    
      885
      0
      3
      39.0
      0
      5
      29.1250
      0
      1
      0
    
    
      886
      0
      2
      27.0
      0
      0
      13.0000
      1
      0
      1
    
    
      887
      1
      1
      19.0
      0
      0
      30.0000
      0
      0
      1
    
    
      888
      0
      3
      24.0
      1
      2
      23.4500
      0
      0
      1
    
    
      889
      1
      1
      26.0
      0
      0
      30.0000
      1
      0
      0
    
    
      890
      0
      3
      32.0
      0
      0
      7.7500
      1
      1
      0
    
  

889 rows × 9 columns



In [452]:

    
train.count()









    Out[452]:





Survived    889
Pclass      889
Age         889
SibSp       889
Parch       889
Fare        889
male        889
Q           889
S           889
dtype: int64



In [453]:

    
X = train.drop('Survived',axis=1)
y = train['Survived']



In [456]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.20,random_state = 1)



In [464]:

    
from sklearn.linear_model import LogisticRegression
Logistic_M = LogisticRegression(n_jobs=5)



In [465]:

    
Logistic_M.fit(X_train,y_train)









    Out[465]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=5,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [466]:

    
predictions = Logistic_M.predict(X_test)



In [467]:

    
from sklearn.metrics import classification_report



In [468]:

    
print(classification_report(y_test,predictions))









    



             precision    recall  f1-score   support

          0       0.85      0.87      0.86       105
          1       0.80      0.78      0.79        73

avg / total       0.83      0.83      0.83       178



In [469]:

    
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)









    Out[469]:





array([[91, 14],
       [16, 57]])

K Nearest Neighhors

KNN is a classification algorithm.

How it works?

1.

Pro's:

1. Very Simple
2. Training is trivial
3. Works with any number of classes
4. Easy to add more data
5. Few parameter
    a. K
    b. Distance Metric

Con's :

1. High Prediction Cost
2. Not good with high dimensional data
3. Categorial Features don't work well

KNN Use Case :



In [3]:

    
# importing the requried Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline



In [5]:

    
# Importing the Requried Dataset
data = pd.read_csv("C:/Users/melvin/Machine Learning/KNN/KNN_Project_Data.csv")
data.head()









    Out[5]:






  
    
      
      XVPM
      GWYH
      TRAT
      TLLZ
      IGGA
      HYKR
      EDFS
      GUUB
      MGJM
      JHZC
      TARGET CLASS
    
  
  
    
      0
      1636.670614
      817.988525
      2565.995189
      358.347163
      550.417491
      1618.870897
      2147.641254
      330.727893
      1494.878631
      845.136088
      0
    
    
      1
      1013.402760
      577.587332
      2644.141273
      280.428203
      1161.873391
      2084.107872
      853.404981
      447.157619
      1193.032521
      861.081809
      1
    
    
      2
      1300.035501
      820.518697
      2025.854469
      525.562292
      922.206261
      2552.355407
      818.676686
      845.491492
      1968.367513
      1647.186291
      1
    
    
      3
      1059.347542
      1066.866418
      612.000041
      480.827789
      419.467495
      685.666983
      852.867810
      341.664784
      1154.391368
      1450.935357
      0
    
    
      4
      1018.340526
      1313.679056
      950.622661
      724.742174
      843.065903
      1370.554164
      905.469453
      658.118202
      539.459350
      1899.850792
      0



In [6]:

    
sns.pairplot(data=data)









    Out[6]:





<seaborn.axisgrid.PairGrid at 0x4d182e8>



In [7]:

    
# Standardize the Variables



In [8]:

    
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=data.columns[:-1])
df_feat.head()



In [10]:

    
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,data['TARGET CLASS'],test_size=0.30)

from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier()
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)



In [11]:

    
# Metrics

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))









    



[[113  32]
 [ 33 122]]
             precision    recall  f1-score   support

          0       0.77      0.78      0.78       145
          1       0.79      0.79      0.79       155

avg / total       0.78      0.78      0.78       300



In [12]:

    
# Create a KNN model instance with n_neighbors=n
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))



In [13]:

    
# Fit this KNN model to the training data

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')









    Out[13]:





<matplotlib.text.Text at 0x132445f8>



In [18]:

    
from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier(n_neighbors=20)
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)



In [19]:

    
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))









    



[[120  25]
 [ 29 126]]
             precision    recall  f1-score   support

          0       0.81      0.83      0.82       145
          1       0.83      0.81      0.82       155

avg / total       0.82      0.82      0.82       300

Done!



In [ ]:

    
# Random Forest

	XVPM	GWYH	TRAT	TLLZ	IGGA	HYKR	EDFS	GUUB	MGJM	JHZC
0	1.568522	-0.443435	1.619808	-0.958255	-1.128481	0.138336	0.980493	-0.932794	1.008313	-1.069627
1	-0.112376	-1.056574	1.741918	-1.504220	0.640009	1.081552	-1.182663	-0.461864	0.258321	-1.041546
2	0.660647	-0.436981	0.775793	0.213394	-0.053171	2.030872	-1.240707	1.149298	2.184784	0.342811
3	0.011533	0.191324	-1.433473	-0.100053	-1.507223	-1.753632	-1.183561	-0.888557	0.162310	-0.002793
4	-0.099059	0.820815	-0.904346	1.609015	-0.282065	-0.365099	-1.095644	0.391419	-1.365603	0.787762

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price	Address
0	79545.458574	5.682861	7.009188	4.09	23086.800503	1.059034e+06	208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
1	79248.642455	6.002900	6.730821	3.09	40173.072174	1.505891e+06	188 Johnson Views Suite 079\nLake Kathleen, CA...
2	61287.067179	5.865890	8.512727	5.13	36882.159400	1.058988e+06	9127 Elizabeth Stravenue\nDanieltown, WI 06482...
3	63345.240046	7.188236	5.586729	3.26	34310.242831	1.260617e+06	USS Barnett\nFPO AP 44820
4	59982.197226	5.040555	7.839388	4.23	26354.109472	6.309435e+05	USNS Raymond\nFPO AE 09386

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
count	5000.000000	5000.000000	5000.000000	5000.000000	5000.000000	5.000000e+03
mean	68583.108984	5.977222	6.987792	3.981330	36163.516039	1.232073e+06
std	10657.991214	0.991456	1.005833	1.234137	9925.650114	3.531176e+05
min	17796.631190	2.644304	3.236194	2.000000	172.610686	1.593866e+04
25%	61480.562388	5.322283	6.299250	3.140000	29403.928702	9.975771e+05
50%	68804.286404	5.970429	7.002902	4.050000	36199.406689	1.232669e+06
75%	75783.338666	6.650808	7.665871	4.490000	42861.290769	1.471210e+06
max	107701.748378	9.519088	10.759588	6.500000	69621.713378	2.469066e+06

Dep. Variable:	y	R-squared:	0.957
Model:	OLS	Adj. R-squared:	0.956
Method:	Least Squares	F-statistic:	668.7
Date:	Thu, 03 Aug 2017	Prob (F-statistic):	1.02e-257
Time:	14:55:47	Log-Likelihood:	-1226.0
No. Observations:	404	AIC:	2478.
Df Residuals:	391	BIC:	2530.
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
x1	-0.1020	0.038	-2.697	0.007	-0.176 -0.028
x2	0.0426	0.017	2.483	0.013	0.009 0.076
x3	-0.0367	0.071	-0.517	0.605	-0.176 0.103
x4	3.3418	1.029	3.248	0.001	1.319 5.365
x5	-1.3299	3.850	-0.345	0.730	-8.899 6.239
x6	5.6706	0.353	16.078	0.000	4.977 6.364
x7	0.0025	0.016	0.163	0.871	-0.028 0.033
x8	-0.8610	0.220	-3.919	0.000	-1.293 -0.429
x9	0.1935	0.074	2.629	0.009	0.049 0.338
x10	-0.0090	0.004	-2.121	0.035	-0.017 -0.001
x11	-0.4256	0.127	-3.349	0.001	-0.676 -0.176
x12	0.0175	0.003	5.729	0.000	0.011 0.023
x13	-0.4571	0.059	-7.787	0.000	-0.573 -0.342

Omnibus:	151.909	Durbin-Watson:	1.990
Prob(Omnibus):	0.000	Jarque-Bera (JB):	949.294
Skew:	1.464	Prob(JB):	7.30e-207
Kurtosis:	9.916	Cond. No.	8.46e+03

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	male	Q	S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	S	1	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C	0	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	S	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	S	0	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	S	1	0	1
5	6	0	3	Moran, Mr. James	male	24.0	0	0	330877	8.4583	Q	1	1	0
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	S	1	0	1
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	S	1	0	1
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	S	0	0	1
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	C	0	0	0
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	S	0	0	1
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	S	0	0	1
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	S	1	0	1
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	S	1	0	1
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	S	0	0	1
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	S	0	0	1
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	Q	1	1	0
17	18	1	2	Williams, Mr. Charles Eugene	male	24.0	0	0	244373	13.0000	S	1	0	1
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	S	0	0	1
19	20	1	3	Masselmani, Mrs. Fatima	female	24.0	0	0	2649	7.2250	C	0	0	0
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	S	1	0	1
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	S	1	0	1
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	Q	0	1	0
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	S	1	0	1
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	S	0	0	1
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	S	0	0	1
26	27	0	3	Emir, Mr. Farred Chehab	male	24.0	0	0	2631	7.2250	C	1	0	0
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	S	1	0	1
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	24.0	0	0	330959	7.8792	Q	0	1	0
29	30	0	3	Todoroff, Mr. Lalio	male	24.0	0	0	349216	7.8958	S	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
861	862	0	2	Giles, Mr. Frederick Edward	male	21.0	1	0	28134	11.5000	S	1	0	1
862	863	1	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	S	0	0	1
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	24.0	8	2	CA. 2343	69.5500	S	0	0	1
864	865	0	2	Gill, Mr. John William	male	24.0	0	0	233866	13.0000	S	1	0	1
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	S	0	0	1
866	867	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	C	0	0	0
867	868	0	1	Roebling, Mr. Washington Augustus II	male	31.0	0	0	PC 17590	50.4958	S	1	0	1
868	869	0	3	van Melkebeke, Mr. Philemon	male	24.0	0	0	345777	9.5000	S	1	0	1
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	S	1	0	1
870	871	0	3	Balkic, Mr. Cerin	male	26.0	0	0	349248	7.8958	S	1	0	1
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	S	0	0	1
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	S	1	0	1
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	S	1	0	1
874	875	1	2	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0	1	0	P/PP 3381	24.0000	C	0	0	0
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	C	0	0	0
876	877	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	S	1	0	1
877	878	0	3	Petroff, Mr. Nedelio	male	19.0	0	0	349212	7.8958	S	1	0	1
878	879	0	3	Laleff, Mr. Kristo	male	24.0	0	0	349217	7.8958	S	1	0	1
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C	0	0	0
880	881	1	2	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	0	1	230433	26.0000	S	0	0	1
881	882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	S	1	0	1
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	S	0	0	1
883	884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000	S	1	0	1
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	S	1	0	1
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	Q	0	1	0
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	S	1	0	1
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	S	0	0	1
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	24.0	1	2	W./C. 6607	23.4500	S	0	0	1
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C	1	0	0
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	Q	1	1	0

	XVPM	GWYH	TRAT	TLLZ	IGGA	HYKR	EDFS	GUUB	MGJM	JHZC	TARGET CLASS
0	1636.670614	817.988525	2565.995189	358.347163	550.417491	1618.870897	2147.641254	330.727893	1494.878631	845.136088	0
1	1013.402760	577.587332	2644.141273	280.428203	1161.873391	2084.107872	853.404981	447.157619	1193.032521	861.081809	1
2	1300.035501	820.518697	2025.854469	525.562292	922.206261	2552.355407	818.676686	845.491492	1968.367513	1647.186291	1
3	1059.347542	1066.866418	612.000041	480.827789	419.467495	685.666983	852.867810	341.664784	1154.391368	1450.935357	0
4	1018.340526	1313.679056	950.622661	724.742174	843.065903	1370.554164	905.469453	658.118202	539.459350	1899.850792	0